1 Introduction

This report demonstrates the preprocessing and sales forecasting task for a large collection of historical sales data, relating to a European drug store chain. The focus is primarily on the code and techniques required to pre-process the data into a state that is usable for later analysis. As such, the structure of this report will first give a descriptive narrative of the data pre-processing performed. Following this, a forecasting analysis will predict future sales figures for a set 6 week future time period, evaluated with the Root Mean Squared Percentage Error (RMSPE).

Data preprocessing is considered an important aspect of data analytics, and one that is often neglected (García, Luengo, and Herrera 2015). In the context of this report, this stage of the analytical task includes the exploration of the data being considered; through direct observations, and visualisations, data cleaning; to remove erroneous data, data transformation; to ensure the existing data provides the best results, and data integration; the inclusion of external data.

This document was written using R Markdown (Allaire et al. 2020; R Core Team 2020), with the Bookdown Package (Xie 2020), and my custom theme based on epuRate (Holtz 2020). Note that code chunks may be folded individually, or collectively using the top right button.

2 Data Preprocessing

This section covers the data preprocessing exercise, performed entirely in the Python programming language (Van Rossum and Drake Jr 1995). The below code chunk will automatically install and load the required R packages for producing this R Markdown document. This chunk also initiates a python virtual environment called a1_sales and installs the python packages required.

if (!require("pacman")) install.packages("pacman")
pkgs <- c(
    "bibtex",
    "rmarkdown",
    "bookdown",
    "knitr",
    "magrittr",
    "reticulate"
)
pacman::p_load(pkgs, character.only = T)
knitr::opts_chunk$set(
    message = FALSE,
    warning = FALSE,
    comment = NA,
    cache = TRUE
)
write.bib(pkgs, "bib/rbib.bib")
write.bib(c("base", "epuRate"), "bib/rbib.bib", append = TRUE)

python_packages <- c(
    "numpy",
    "matplotlib",
    "seaborn",
    "pandas",
    "statsmodels",
    "fbprophet"
)
reticulate::virtualenv_install(
    envname = "a1_sales",
    python_packages
)
## Using virtual environment 'a1_sales' ...

With the R and python environments correctly set up, the python packages required may be imported, for this section these include key libraries for plotting, dataframe manipulation and analysis. This report will use the optional mypy static type checking, which in some cases requires the use of the typing library. The Gadfly Matplotlib theme is used, which is inspired by the Julia Programming Language theme of the same name. The python packages numpy, matplotlib and pandas are all part of the SciPy ecosystem (Virtanen et al. 2020), while statsmodels provides separate functions for statistical and econometric analysis (Seabold and Perktold 2010). fbprophet is a package developed by Facebook researchers which provides forecasting analysis tools, discussed in the later section (Taylor and Letham 2017).

import numpy as np  # numerical operations
import matplotlib.pyplot as plt  # visualisations
import seaborn as sns  # more vis
import pandas as pd  # for working with dataframes
import statsmodels.api as sm  # stats functions

from datetime import date, datetime  # working with datetime types
from scipy.stats import skew  # for reporting skew
from typing import Dict  # type checking


plt.style.use('gadfly')

2.1 Store data

First store.csv was read in, containing information regarding the stores in this analysis. The final two columns in store.csv were empty and unlabelled so were removed. The store.csv file provides variables regarding the 1,115 individual stores considered in this analysis. Table 2.1 gives an overview of the information provided regarding this store level information.

Table 2.1: Descriptive information regarding the store level variables.
Column Description
Store the anonymised store number
StoreType 4 different store models: a, b, c, d
Assortment an assortment level: a = basic, b = extra, c = extended
CompetitionDistance distance in meters to the nearest competitor store
CompetitionOpenSinceMonth the approximate month of the time when the nearest competitor was opened
CompetitonOpenSinceYear the approximate year of the time when the nearest competitor was opened
Promo2 a continuing and consecutive promotion
Promo2SinceWeek the calendar week when the store started participating in Promo2
Promo2SinceYear the year when the store started participating in Promo2
PromoInterval consecutive intervals in which Promo2 is restarted, naming the months

The CompetitionOpenSinceYear and CompetitionOpenSinceMonth variables were combined, making the assumption that all stores opened on the first day of the month, and converted into a single variable with the datetime type called CompetitionDate. This made working with the data as a timeseries easier. Similarly, the Promo2Start variable was created from the Promo2SinceYear and Promo2SinceWeek variables, using a week is decidedly more complex, particularly as the number of weeks in a year is 52.1775, meaning the number of weeks in a year are either considered to be 52, or 53. The International Standards Organisation (ISO) provides the ISO week date system as part of the ISO 8601 datetime standard (ISO 2019), which since pandas 3.8 is accessible through the fromisocalendar method. This method may be applied to every non NULL row in the dataframe to convert the week and year provided into a datetime variable type, with the assumption that the promotion started on the first day of the week (Monday). As Promo2 runs for 3 months, indicated by the data documentation, and repeats every 3 months (PromoInterval variable), it can be assumed that once Promo2 has been initiated, it will always be running. Additionally, the store Assortment type was converted from categorical to dummy variables.

# read in store data, final two columns are untitled and blank
# low memory False excludes errors about multiple types in cols
store = pd.read_csv("~/data/modules/envs801/DA1920_store.csv",
                    low_memory=False).iloc[:, :-2]

# create competition start date as datetime
store['CompetitionDate'] = pd.to_datetime(
    {'year': store['CompetitionOpenSinceYear'],
     'month': store['CompetitionOpenSinceMonth'],
     'day': 1}
)


def week_year_to_datetime(row: pd.Series) -> pd.Series:
    """Convert year and week column to date"""
    return np.datetime64(
        date.fromisocalendar(int(row['Promo2SinceYear']),
                             int(row['Promo2SinceWeek']), 1)
    )


# ensure NA values don't cause error when converting
# year and week to dates
store['Promo2Start'] = store.apply(
    lambda row: week_year_to_datetime(row)
    if row[['Promo2SinceYear',
            'Promo2SinceWeek']].notnull().all() else None, axis=1
)

# convert assortment from categorical to dummy variables
store = pd.get_dummies(store, columns=['Assortment'], drop_first=True)

The CompetitionDistance variable was highly skewed (Skew: ), so the log of the distance was taken, which provided a better distribution, commonly suggested for skewed variables (Ah 2006). The distribution is shown on the Figure 2.1, which also indicates the 50% quantiles.

store['log_competition_dist'] = np.log1p(store['CompetitionDistance'])

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 8))

store['CompetitionDistance'].hist(bins=20, edgecolor='k', ax=axes[0])
axes[0].axvline(store['CompetitionDistance'].quantile(),
                color='k', linestyle='dashed')
axes[0].set_xlabel('Competition Distance (m)')

store['log_competition_dist'].hist(bins=20, edgecolor='k', ax=axes[1])
axes[1].axvline(store['log_competition_dist'].quantile(),
                color='k', linestyle='dashed')
axes[1].set_xlabel('Log Competition Distance')
plt.show()
Distribution of the competition distance in meters, and the log of the competition distance. Black dotted vertical line indicates the 50% quantiles

Figure 2.1: Distribution of the competition distance in meters, and the log of the competition distance. Black dotted vertical line indicates the 50% quantiles

All unused variables were now dropped, leaving the final store level variables given below.

store.drop(['PromoInterval',
            'CompetitionOpenSinceMonth',
            'CompetitionOpenSinceYear',
            'Promo2SinceWeek',
            'Promo2SinceYear',
            'CompetitionDistance'], axis=1, inplace=True)
print(store.columns)
Index(['Store', 'StoreType', 'Promo2', 'CompetitionDate', 'Promo2Start',
       'Assortment_b', 'Assortment_c', 'log_competition_dist'],
      dtype='object')

Only the CompetitionDate, Promo2Start, and log_competition_dist columns now contain null values. This makes sense, and is expected in cases where stores have never had nearby competition, or have never activated Promo 2.

2.2 Test and training data

Following the store data cleaning, the training and testing data was read in. When reading in the CSVs, the Date columns were parsed as the datetime type. These dataframes could be easily matched to store.csv based on the Store variable to which contains unique numerical IDs for each store in the data. The data was limited to the date ranges specified in the overview, to ensure no rogue data was included. During exploratory analysis it was noted that it appeared for some test dates they were using the American date system, these were converted to the ISO date standard and re-included into the dataset.

train = pd.read_csv("~/data/modules/envs801/DA1920_train.csv",
                    low_memory=False,
                    parse_dates=['Date']
                    )
train = train[(train['Date'] >= '2013-01-01') &
              (train['Date'] <= '2015-07-31')]
train = train.merge(store, on='Store')

test = pd.read_csv("~/data/modules/envs801/DA1920_test.csv",
                   low_memory=False,
                   parse_dates=['Date'])

# some dates are for some reason American formatted
test_wrong_dates = test[(test['Date'] < '2015-08-01') |
                        (test['Date'] > '2015-09-17')]

# convert from US to ISO dates
fixed_dates = test_wrong_dates['Date']\
                    .apply(lambda x: datetime.strftime(x, '%Y-%d-%m'))
test_wrong_dates['Date'] = fixed_dates.values

# convert back to datetime
<string>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
test_wrong_dates['Date'] = pd.to_datetime(test_wrong_dates['Date'])

# add fixed dates to known correct dates
test = test.append(test_wrong_dates)

# ensure dates are within specified range
test = test[(test['Date'] >= '2015-08-01') &
            (test['Date'] <= '2015-09-17')]
test = test.merge(store, on='Store')

Table 2.2 gives an overview of the variables provided with both the test and training data. While the test data contains all variables, both the sales and customer columns contain only NA values. The training dataset contains NULL rows, and testing dataset contains NULL rows.

Table 2.2: Descriptive information regarding the timeseries data.
Column Description
Store the anonymised store number
DayOfWeek the day of the week 1 = Monday etc.
Date the given date
Sales the turnover on a given day
Customers the number of customers on a given day
Open an indicator for whether a store is open
Promo whether a store is running a promotion that day
StateHoliday state holidays, including public, easter, Christmas
SchoolHoliday days where public schools are closed

Using the newly created CompetitionDate variable, a store could be determined as having active competition at the date the data was recorded. CompetitionDistance was then converted into a binary variable, NearbyCompetition, considering any open competition nearer than the 50% quantile of all distances to be nearby. This is given by the competition_fix function.

def competition_fix(df: pd.DataFrame) -> pd.DataFrame:
    """Consider nearby competition that is open

    Creates a binary variable to determine whether competition
    is open given the observation date. Selects competitions
    that are both open and within 50th quantile as 1, 0 otherwise.

    Args:
        df (pd.DataFrame): Test or Train dataset in this report
    """
    # Only show competition if open by date of sale
    df['Competition'] = np.where(df['CompetitionDate'] <
                                 df['Date'], 1, 0)
    # nearby competition if open and less than 50% quantile
    df['NearbyCompetition'] = np.where(
        (df['log_competition_dist'] > df['log_competition_dist'].quantile()) &
        (df['Competition'] == 1),
        1, 0
    )
    return df


train = competition_fix(train)
test = competition_fix(test)

The average_spend variable was created by dividing the number of sales by the number of customers. Interestingly for store type b, the daily sales were highest, but customers spent by far the least on average. This is given on Table 2.3.

# new av spend variable
train['average_spend'] = train['Sales'] / train['Customers']

vals = {}
for storetype in train['StoreType'].unique():
    r = train[train['StoreType'] == storetype]
    vals[storetype] = r[['Sales', 'Customers', 'average_spend']].mean()

avg_tab = pd.DataFrame(vals).transpose()
Table 2.3: Average number of sales, customers and average spend by store type.
Sales Customers average_spend
c 5752.341 675.8023 8.636150
a 5763.960 661.1172 8.856334
d 5664.744 502.9213 11.289068
b 10060.707 1986.2849 5.134703

Whether Promotion 2 was active in a particular store at the date was derived and given a binary value, given by the promo_fix function, holidays were also converted to binary variables. Finally, all now unneeded columns were dropped from both the test and training data.

def promo_fix(df: pd.DataFrame) -> pd.DataFrame:
    # promo2 active if since week and year
    df['Promo2Precise'] = np.where(
        df['Promo2Start'] <= df['Date'], 1, 0
    )
    return df


train = promo_fix(train)
test = promo_fix(test)


def convert_holidays(df: pd.DataFrame) -> pd.DataFrame:
    df['StateHoliday'] = np.where(df['StateHoliday'] == '0', 0, 1)
    return df


train = convert_holidays(train)
test = convert_holidays(test)


def drop_cols(df: pd.DataFrame) -> pd.DataFrame:
    df.drop(['log_competition_dist',
             'Promo2',
             'CompetitionDate',
             'Promo2Start',
             'Competition'], axis=1, inplace=True)
    return df


train = drop_cols(train)
test = drop_cols(test)

ONS GDP data was integrated, and used as a proxy considering the increase in spending power across the national population. This data is provided by yearly quarters, so to combine with the existing sales data this had to first be converted to the datetime type.

ons_gdp = pd.read_csv("~/data/modules/envs801/series-100620.csv", skiprows=8,
                      header=None)
ons_gdp = ons_gdp[ons_gdp[0].str.contains('^2013\s', regex=True) |
                  ons_gdp[0].str.contains('^2014\s', regex=True) |
                  ons_gdp[0].str.contains('^2015\s', regex=True)]
ons_gdp[0] = pd.to_datetime(ons_gdp[0].str.replace(' ', ''))
ons_gdp.rename(columns={0: 'Date', 1: 'gdp'}, inplace=True)


def merge_ons(df, ons):
    df = df.sort_values('Date')
    df = df.merge(ons, on='Date', how='left')\
            .fillna(method='ffill')
    return df


train = merge_ons(train, ons_gdp)

# test data uses Q4 2015 for gdp
test['gdp'] = ons_gdp.iloc[-1, 1]

2.3 Stationarity of TimeSeries

Timeseries data is likely to suffer from seasonality. Figure 2.2 shows the weekly seasonality of the data. It is clear that for stores excluding type b there is a drop in Sunday sales, this is likely due to Sunday trading laws.

def group_dates(df: pd.DataFrame, time_period: str) -> pd.DataFrame:
    df.groupby([time_period, 'StoreType'],
               as_index=False)['Sales'].mean()
    return df


weekly_train = group_dates(train, 'DayOfWeek')

train['Month'] = train['Date'].dt.month
monthly_train = group_dates(train, 'Month')

sns.catplot(data=weekly_train, x='DayOfWeek', y='Sales',
            col='StoreType', kind='point', hue='StoreType')
<seaborn.axisgrid.FacetGrid object at 0x7f9bf4392fd0>
plt.show()
Weekly seasonality by store type.

Figure 2.2: Weekly seasonality by store type.

Figure 2.3 gives the monthly seasonality for each store type, as is expected, sales tend to peak towards the end of the year, this is common across all stores, and a common observation in sales data (Hylleberg 1992).

sns.catplot(data=monthly_train, x='Month', y='Sales',
            col='StoreType', kind='point', hue='StoreType')
<seaborn.axisgrid.FacetGrid object at 0x7f9bf338d3d0>
plt.show()
Monthly seasonality by store type.

Figure 2.3: Monthly seasonality by store type.

2.4 Accounting for Trend and Seasonality

The overall trend of the sales per store type is visualised below. There is a clear overall increasing trend, which is particularly noticeable for the b store types.

# visualise overall trend between store types
train.set_index('Date', inplace=True)


def resample_by_col(df: pd.DataFrame, col: pd.Series, rule: str) -> Dict:
    """Allow resampling to different time granularity

    This may be used to resample to months or weeks for example. Providing
    the ability to choose multiple columns to aggregate the mean data by.
    For example store types, and store assortments.
    """
    return {x: df[df[col] == x]
            .resample(rule=rule)['Sales'].mean()
            for x in df[col].unique()}


# resample data to month by storetypes
resampled_storetype = resample_by_col(train, 'StoreType', '1M')
fig, ax = plt.subplots(figsize=(14, 14))
for key, storetype in resampled_storetype.items():
    ax.plot(storetype, label=key)
    ax.legend()
plt.show()
Overall trend of sales by store type, grouped by mean number of sales per month.

Figure 2.4: Overall trend of sales by store type, grouped by mean number of sales per month.

The statsmodels python library provides the seasonal_decompose function which extracts seasonal and trend values for timeseries data. Essentially these capture the fluctuations in sales that happen yearly, e.g. increased sales towards the end of the year, and the overall increasing trend of the sales, as shown on Figure 2.4. These values attempt to account for these changes over time, and as such provide additional variables which may be considered in a linear model. It should be considered that from the documentation for this function, it is noted that this method uses a naive decomposition, and more sophisticated methods should be preferred.

The create_trends function captures this seasonality and trend by store type, which is essential due to the disparity between b stores and others, as shown on Figures 2.4, 2.2, and 2.3.

def create_trends(df: pd.DataFrame, type: str) -> Dict:
    """Obtain statsmodel trends values by storetype
    
    Choose either seasonal or trend from statsmodel, convert to dict
    of storetypes and trend values. Dict then converted to pandas dataframe.
    """
    if type == 'seasonal':
        trend = {keys: sm.tsa.seasonal_decompose(storetype).seasonal
                 for keys, storetype in df.items()}
    elif type == 'trend':
        trend = {keys: sm.tsa.seasonal_decompose(storetype).trend
                 for keys, storetype in df.items()}
    trend = pd.melt(
        pd.DataFrame(trend).reset_index(),
        id_vars='Date',
        value_name=type,
        var_name='StoreType'
    )
    return trend


seasonal_trend = create_trends(resampled_storetype, type='seasonal')
time_trend = create_trends(resampled_storetype, type='trend')

# ffill will fill in gaps from start of month to end by store type
train = train.merge(seasonal_trend, on=['Date', 'StoreType'], how='left')\
    .fillna(method='ffill')

# must also backwards fill this data
train = train.merge(time_trend, on=['Date', 'StoreType'], how='left')\
    .fillna(method='ffill')\
    .fillna(method='bfill')

With all the variables now created, a heat map can now be used to determine the correlation between each, shown on Figure 2.5. Key observations are that Promo gives higher sales, state holidays mean lower sales, and there appears to be a slight drop in sales with Promo2. Interestingly, average spend is negatively correlated with the time trend, suggesting that customers are spending less on average, despite customers being correlated positively with time. Again here the Sunday trading laws are reflected by the reduction in sales for later days in the week (higher values). The extracted trend variable also indicates a strong correlation with the Assortment b type, suggesting that in addition to the categorical store types, the assortment type should be considered as well. The GDP data shows a clear correlation with the trend, implying that sales trends do closely follow GDP.

fig, ax = plt.subplots(figsize=(14, 14))
corr = train.drop(['Open', 'Store'], axis=1).corr()
ax = sns.heatmap(
    corr,
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True,
    annot=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
)
[Text(0.5, 0, 'DayOfWeek'), Text(1.5, 0, 'Sales'), Text(2.5, 0, 'Customers'), Text(3.5, 0, 'Promo'), Text(4.5, 0, 'StateHoliday'), Text(5.5, 0, 'SchoolHoliday'), Text(6.5, 0, 'Assortment_b'), Text(7.5, 0, 'Assortment_c'), Text(8.5, 0, 'NearbyCompetition'), Text(9.5, 0, 'average_spend'), Text(10.5, 0, 'Promo2Precise'), Text(11.5, 0, 'gdp'), Text(12.5, 0, 'Month'), Text(13.5, 0, 'seasonal'), Text(14.5, 0, 'trend')]
plt.show()
Correlation of all variables in relation to each other for the final cleaned data.

Figure 2.5: Correlation of all variables in relation to each other for the final cleaned data.

This cleaned data was then saved to new CSVs.

train.to_csv('~/data/modules/envs801/derived/train_clean.csv')
test.to_csv('~/data/modules/envs801/derived/test_clean.csv')

3 Timeseries Analysis

This analysis uses the Facebook Prophet library for forecasting (Taylor and Letham 2017). Prophet uses an additive model for considering non linear trends in yearly, weekly, and daily seasonality. Prophet also includes the ability to consider holiday effects, shifts in the trends, outliers and missing data. While the above data preprocessing demonstrates the ability to capture trends and seasonality, it is not necessary to use this information with Prophet, as these are derived automatically, and with more complex and accurate methods. This also means that when predicting future values, trend and seasonality may be considered even when unknown, in addition to the variables provided in the testing data.

First the libraries were loaded in along with the cleaned training and test data.

from fbprophet.plot import plot_cross_validation_metric
from fbprophet.diagnostics import performance_metrics, cross_validation
from fbprophet import Prophet
from typing import Tuple

train = pd.read_csv('~/data/modules/envs801/derived/train_clean.csv',
                    index_col=0)
test = pd.read_csv('~/data/modules/envs801/derived/test_clean.csv',
                   index_col=0)

For this analysis it was determined that it would be simpler to keep only stores which were open, as sales were of interest and any store closed should have no sales. Any day where a large number of stores were closed e.g. holidays, will be reflected in a lower number of total sales. Due to the clear differences between store types, it was appropriate that the analysis should be split. The following section explores the prediction of the future number of sales for the b type of stores only, taking the total number of sales per day across all b stores.

# keep only open stores
# sales will reflect closed stores
train = train[train['Open'] == 1]
test = test[test['Open'] == 1]


def groupby_storetype(df: pd.DataFrame, store_type: str) -> pd.DataFrame:
    """Keep columns present in testing data, group by storetype"""
    gdp = df[df['StoreType'] == store_type]\
            .loc[:, ['Date', 'gdp']]\
            .groupby(['Date'], as_index=False).mean()
    df = df[df['StoreType'] == store_type]\
        .loc[:, ['Date',
                 'Sales',
                 'Promo',
                 'Promo2Precise',
                 'Assortment_b',
                 'Assortment_c',
                 'NearbyCompetition'
                 ]]\
        .groupby(['Date'], as_index=False).sum()
    df['Date'] = pd.DatetimeIndex(df['Date'])
    df['gdp'] = gdp['gdp']
    df.rename(columns={'Date': 'ds',
                       'Sales': 'y'}, inplace=True)
    return df

sales = groupby_storetype(train, store_type='b')
test = groupby_storetype(test, store_type='b').drop('y', axis=1)

The total daily sales for the b stores are given below.

# plot daily sales
ax = sales.set_index('ds')['y'].plot(figsize=(14, 4))
ax.set_ylabel('Daily Number of Sales')
ax.set_xlabel('Date')
plt.show()
Total number of sales per day over the timeseries for the **b** store types.

Figure 3.1: Total number of sales per day over the timeseries for the b store types.

Prophet takes the holidays into account, so these were separated into their own dataframe.

state_dates = train[(train['StateHoliday'] == 1)]['Date'].values
school_dates = train[train['SchoolHoliday'] == 1]['Date'].values

state = pd.DataFrame({'holiday': 'state_holiday',
                      'ds': pd.to_datetime(state_dates)})
school = pd.DataFrame({'holiday': 'school_holiday',
                       'ds': pd.to_datetime(school_dates)})

holidays = pd.concat((state, school))

The model could now be built, the function below provides options for the inclusion of all additional regressions, and whether to run predictive analysis on the test data. This function outputs the model itself, the predicted values, and the Root Mean Square Percentage Error (RMSPE).

def create_model(
        df: pd.DataFrame,
        include_regressors: bool,
        holidays: pd.DataFrame,
        test_data: pd.DataFrame = None) -> Tuple[Prophet, pd.DataFrame, float]:
    """Fit Prophet models with adjustable params

    Allows for inclusion of test data, holidays, and regressors 
    in Prophet model. Function will fit the model to output the model, 
    forecasted values, optionally including future dates, and the global
    root mean square error.

    Args:
        df (pd.DataFrame): Containing y output var, optionally regressors
        include_regressors (bool): Optionally include all regressors
        holidays (pd.DataFrame): Contains dates for all holidays
        test_data (pd.DataFrame, optional): future dates, optional regressors

    Returns:
        Tuple[Prophet, pd.DataFrame, int]: Model, forecast values and RMSPE.
    """
    model = Prophet(holidays=holidays,
                    daily_seasonality=False)

    if include_regressors:
        for col in df.columns[2:]:
            model.add_regressor(col)

    model.fit(df)

    if test_data is not None:
        forecast = model.predict(df.append(test_data))
    else:
        forecast = model.predict(df)

    fc = forecast.merge(df[['ds', 'y']], on='ds', how='left')
    fc_drop = fc.dropna()
    # values lower than zero = 1 for rmse calcs
    fc_drop['yhat'] = np.where(fc_drop['yhat'].values <= 0, 1, fc_drop['yhat'].values)
    fc_drop['y'] = np.where(fc_drop['y'].values <= 0, 1, fc_drop['y'].values)

    def find_rmspe(y_true: pd.Series, y_pred: pd.Series) -> float:
        return np.sqrt(
            np.mean(np.square(((y_true - y_pred) / y_true)), axis=0)
        ) * 100

    rmspe = find_rmspe(fc_drop['y'], fc_drop['yhat'])
    return model, fc, rmspe

Two initial models were ran, one with regressors, and one without inclusion of regressors. Table 3.1 shows that the inclusion of regressors gave a small improvement in the RMSPE so this model will be used for the final analysis.

As the predictions were made using the total collectives sales of all b stores, the regressors included provided the sum of all stores with that particular feature. These regressors include Promo, the sum of all stores running the first promotion on a particular date, Promo2Precise, the sum of all stores running the second promotion, and NearbyCompetition, the sum of all nearby competition. Also included are Assortment_b, and Assortment_c, as the sum of all stores that operate with their respective assortment. In this case, variations in the numbers reflect the closure of particular stores on given days, allowing for the number of open stores of particular assortment types to be indicative of fluctuations in sales. Assortment a stores are indicated when assortment b and c took values lower than the total number of stores. The quarterly GDP provided by ONS is self explanatory, and purely provides the national GDP at that date, as a proxy that may represent national trends in overall sales.

These regressors represent the maximum amount of additional information that may be provided to the model on a day to day basis using the chosen method.

model1, fc1, rmspe1 = create_model(sales,
                                   include_regressors=True,
                                   holidays=holidays)
model2, fc2, rmspe2 = create_model(sales,
                                   include_regressors=False,
                                   holidays=holidays)

cols = ['Model 1 (Inc. Regressors)', 'Model 2 (Exc. Regressors)']

rmspe_tab = pd.DataFrame({'Model': ['Model 1 (Inc. Regressors)',
                                    'Model 2 (Exc. Regressors)'],
                          'RMSPE': [rmspe1, rmspe2]})
Table 3.1: Root Mean Squared Percentage Errors (RMSPE) for Model 1 and Model 2
Model RMSPE
Model 1 (Inc. Regressors) 12.29847
Model 2 (Exc. Regressors) 15.49807

The more effective model was then used to predict values from the test data. The result of this is given visually on Figure 3.2, the values following the dotted line indicate the new sale predictions.

# model including test data
model, fc, rmspe = create_model(sales,
                                include_regressors=True,
                                holidays=holidays,
                                test_data=test)
<string>:39: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
<string>:40: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
fig, ax = plt.subplots(figsize=(16, 8))
model.plot(fc, ax=ax)
plt.axvline(x=test['ds'].iloc[0], color='k', linestyle='dashed')
plt.show()
Prophet model predicted values for existing data and future predicted data. Dark blue line shows estimated values, lighter blue upper and lower estimates. Black dots are true data points, and black vertical dashed line shows cutoff between real data and predicted.

Figure 3.2: Prophet model predicted values for existing data and future predicted data. Dark blue line shows estimated values, lighter blue upper and lower estimates. Black dots are true data points, and black vertical dashed line shows cutoff between real data and predicted.

In order to test the robustness of the model, Prophet provides some cross-validation functions. Table 3.2 gives an overview of the results produced by this metric, with errors for different horizons. Each horizon is the number of days following the cutoff from the data used to make the prediction. Table 3.2 interestingly shows that despite larger horizons, the model appears to stay robust.

# prophet crossvals
df_cv = cross_validation(model1, horizon='180 days')
INFO:fbprophet:Making 3 forecasts with cutoffs between 2014-08-05 00:00:00 and 2015-02-01 00:00:00
df_p = performance_metrics(df_cv)
performance = df_p.head(3).append(df_p.tail(3))
performance['horizon'] = performance['horizon'].dt.days.astype(str)
Table 3.2: Performance metrics for different horizons for the Prophet forecasting model. Horizon indicates the number of days used to make the prediction.
horizon mse rmse mae mape mdape coverage
0 18 452887404 21281.15 15980.54 0.0925248 0.0642111 0.8402778
1 19 392540866 19812.64 15340.50 0.0911221 0.0672130 0.8472222
2 20 341140653 18469.99 14479.23 0.0890486 0.0642111 0.8611111
156 178 331120114 18196.71 13790.07 0.0831556 0.0677499 0.8125000
157 179 318345264 17842.23 13258.11 0.0805320 0.0644810 0.8125000
158 180 345809329 18595.95 13804.88 0.0844302 0.0644810 0.7916667

Figure 3.3 visualises the results of this cross validation, showing a fairly consistent 10% Mean Absolute Percentage Error (MAPE), for each horizon. There are several large outliers present, these could be reflective of days where large stores are closed for refurbishment, as indicated in the brief.

plot_cross_validation_metric(df_cv, metric='mape')
plt.show()
Mean Absolute Percentage Error (MAPE) for cross validation across multiple horizon windows.

Figure 3.3: Mean Absolute Percentage Error (MAPE) for cross validation across multiple horizon windows.

4 Conclusion

It is essential to understand the data being considered prior to running any analytical task. While the data used in this report concerns stores operated by the same pharmaceutical company, there are significant differences in the way individual stores are operated. The analyses identifies key differences in the trading of b stores compared with a, c, and d stores, as the b stores are permitted to remain open on Sundays. Key differences between these store types are also reflected by the much larger average number of customers for b stores, but with lower average sales per customer. It is clear that b stores represent a much smaller number of stores for this company, but due to their large size, account for a significant proportion of overall sales. This report chose to focus particularly on the analysis of b store due to their increasing sales trend over time, indicating that it is likely the focus on these stores provides the most financial incentive.

While this report explores the use of additional regressors in the forecasting analysis, it is apparent that these provide only a slight improvement to the baseline model. This highlights the widely accepted importance of considering seasonal trends in forecasting analysis (Holt 2004; Hylleberg 1992), something which is hard to capture in traditional regression.


Word count: 2367

References

Ah, Studenmund. 2006. “Using Econometrics: A Practical Guide.”

Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2020. Rmarkdown: Dynamic Documents for R. https://github.com/rstudio/rmarkdown.

García, Salvador, Julián Luengo, and Francisco Herrera. 2015. Data Preprocessing in Data Mining. Vol. 72. Intelligent Systems Reference Library. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-10247-4.

Holt, Charles C. 2004. “Forecasting Seasonals and Trends by Exponentially Weighted Moving Averages.” International Journal of Forecasting 20 (1): 5–10. https://doi.org/10/fg269b.

Holtz, Yan. 2020. EpuRate: A Clean Template for R Markdown Documents.

Hylleberg, Svend. 1992. Modelling Seasonality. Oxford University Press.

ISO. 2019. “ISO 8601-2:2019(En), Date and Time Representations for Information Interchange Part 2: Extensions.” https://www.iso.org/obp/ui/#iso:std:iso:8601:-2:ed-1:v1:en.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Seabold, Skipper, and Josef Perktold. 2010. “Statsmodels: Econometric and Statistical Modeling with Python.” In Python in Science Conference, 92–96. Austin, Texas. https://doi.org/10/ggq6ff.

Taylor, Sean J, and Benjamin Letham. 2017. “Forecasting at Scale.” Preprint. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.3190v2.

Van Rossum, Guido, and Fred L Drake Jr. 1995. Python Tutorial. Vol. 620. Centrum voor Wiskunde en Informatica Amsterdam.

Virtanen, Pauli, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, et al. 2020. “SciPy 1.0–Fundamental Algorithms for Scientific Computing in Python.” Nature Methods 17 (3): 261–72. https://doi.org/10/ggj45f.

Xie, Yihui. 2020. Bookdown: Authoring Books and Technical Documents with R Markdown. https://github.com/rstudio/bookdown.